Introduction to Single-cell analysis

Robel A. Tesfaye

Main steps in single-cell analysis

There is a complete book called OSCA from Bioconductor on the use of bioconductor packages to perform single-cell analysis…

# Download the Github repo following the link above and 
# insert the path to the directory
BiocManager::install(
  remotes::local_package_deps(pkgdir = "~/mount/NAS_CUVIER/grpCuvier/rtesfaye/OSCA.intro-master/",
                              dependencies = TRUE))
bookdown::render_book(input = "~/mount/NAS_CUVIER/grpCuvier/rtesfaye/OSCA.intro-master/inst/book/",output_dir = "~/mount/NAS_CUVIER/grpCuvier/rtesfaye/OSCA_book/")

The major steps of single-cell rna-seq (even scATAQ-Seq) analysis can be summarized by the following commands using Seurat functions:

  • After quality filtering: to keep live cells (not dying cells, undamaged, non empty droplets), non doublets etc…

  • Find the most variable features, to limit memory usage and supposing that the most interesting biological variations are found in the most variable features…

library(Seurat)
snare <- FindVariableFeatures(snare, nfeatures = 3000)
  • Normalize the data: by default this is LogNormalize (feature expression per cell normalized to total expression times 10 000 and log transformed)

  • Scale the data: to have a mean of 0 and sd of 1 across cells (standardization). The aim is to give equal weights to highly expressed genes and lowly expressed genes in downstream analyses (Differential expression, PCA …)

snare <- NormalizeData(snare)
snare <- ScaleData(snare)
  • Linear dimensionality reduction: typically PCA in scRNA and LSI in scATAC.
snare <- RunPCA(snare, npcs = 30)
  • Non-linear dimensionality reduction: UMAP and TSNE could then be used to reduce even more the dimensions. Because your PCA will generally show that after PC2/PC3 you’ll still have significant variation explained. The variation explained could be reduced to the point that you can ignore after PC10. So to include all this variation, further dimensionality reduction is necessary. The non-linear reduction methods will do that by preserving the distances between your cells in the 10 or more PCs you choose to work with.
snare <- RunUMAP(snare, dims = 1:30, reduction.name = "umap.rna")
  • This is clustering between cells. Finding communities of cells/quasi-cliques based on the distances (euclidean) between cells in the PCA (number of PCs to specify)…
snare <- FindNeighbors(snare, dims = 1:30)
snare <- FindClusters(snare, resolution = 0.5, algorithm = 3)
  • You can perform after this differential expression analysis between your clusters…

  • Trajectory analysis and pseudo-time…

Download an example dataset from 10x and import in R

wget -O /home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
wget -O /home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
wget -O /home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)
library(magrittr)
library(tidyverse)
# The atac fragments file contains all the fragments, so it's pretty heavy
# in memory, that's why it's not loaded as an R object
fragpath <- "/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"

# counts <- Read10X_h5("/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
counts = readRDS("/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/counts.RDATA")

Let’s see the contents of counts produced by Read10X_h5

class(counts)
[1] "list"
head(names(counts))
[1] "Gene Expression" "Peaks"          
class(counts[[1]])
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
head(colnames(counts[[1]]))
[1] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1"
[4] "AAACAGCCACACTAAT-1" "AAACAGCCACCAACCG-1" "AAACAGCCAGGATAAC-1"
all(colnames(counts[[1]])==colnames(counts[[2]]))
[1] TRUE
head(rownames(counts[[1]]))
[1] "MIR1302-2HG" "FAM138A"     "OR4F5"       "AL627309.1"  "AL627309.3" 
[6] "AL627309.2" 
head(rownames(counts[[2]]))
[1] "chr1:10109-10357"   "chr1:180730-181630" "chr1:191491-191736"
[4] "chr1:267816-268196" "chr1:586028-586373" "chr1:629721-630172"

Single-cell contains many 0 values. Representing the data in dgcMatrix or similar classes that can contain sparse data saves a lot of memory space. In these classes, 0 values become dots “.”

object.size(as.matrix(counts[[1]]))
# 3490583256 bytes
object.size(counts[[1]])
# 285677928 bytes
# get gene annotations for hg38
# annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
# annotation %<>% `seqlevelsStyle<-`("UCSC")
annotation = readRDS(file = "/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/annotation.RDATA")

# create a Seurat object containing the RNA adata
pbmc <- CreateSeuratObject(
  counts = counts$`Gene Expression`,
  assay = "RNA"
)

# create ATAC assay and add it to the object
pbmc[["ATAC"]] <- CreateChromatinAssay(
  counts = counts$Peaks,
  sep = c(":", "-"),
  fragments = fragpath,
  annotation = annotation
)
rm(counts)
str(pbmc)
Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 2
  .. ..$ RNA :Formal class 'Assay' [package "SeuratObject"] with 8 slots
  .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:23508033] 17 48 53 59 86 109 113 137 168 170 ...
  .. .. .. .. .. ..@ p       : int [1:11910] 0 3308 5204 8108 8954 11236 12589 15650 17341 20369 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 36601 11909
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
  .. .. .. .. .. .. ..$ : chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. .. .. ..@ x       : num [1:23508033] 1 1 1 1 1 1 1 1 1 6 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:23508033] 17 48 53 59 86 109 113 137 168 170 ...
  .. .. .. .. .. ..@ p       : int [1:11910] 0 3308 5204 8108 8954 11236 12589 15650 17341 20369 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 36601 11909
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
  .. .. .. .. .. .. ..$ : chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. .. .. ..@ x       : num [1:23508033] 1 1 1 1 1 1 1 1 1 6 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ scale.data   : num[0 , 0 ] 
  .. .. .. ..@ assay.orig   : NULL
  .. .. .. ..@ var.features : logi(0) 
  .. .. .. ..@ meta.features:'data.frame':  36601 obs. of  0 variables
  .. .. .. ..@ misc         :List of 1
  .. .. .. .. ..$ calcN: logi TRUE
  .. .. .. ..@ key          : chr "rna_"
  .. ..$ ATAC:Formal class 'ChromatinAssay' [package "Signac"] with 16 slots
  .. .. .. ..@ ranges            :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 33 levels "chr1","chr2",..: 1 10 11 12 13 14 15 16 17 18 ...
  .. .. .. .. .. .. .. ..@ lengths        : int [1:33] 10538 4967 5249 5670 2410 3734 3513 4033 5766 2013 ...
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. .. .. ..@ start          : int [1:108377] 10109 180730 191491 267816 586028 629721 633793 777634 816881 819912 ...
  .. .. .. .. .. .. .. ..@ width          : int [1:108377] 249 901 246 381 346 452 472 2293 767 3589 ...
  .. .. .. .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 3
  .. .. .. .. .. .. .. ..@ lengths        : int 108377
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. .. .. .. ..@ seqnames   : chr [1:33] "chr1" "chr2" "chr3" "chr4" ...
  .. .. .. .. .. .. .. ..@ seqlengths : int [1:33] NA NA NA NA NA NA NA NA NA NA ...
  .. .. .. .. .. .. .. ..@ is_circular: logi [1:33] NA NA NA NA NA NA ...
  .. .. .. .. .. .. .. ..@ genome     : chr [1:33] NA NA NA NA ...
  .. .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. .. .. .. ..@ nrows          : int 108377
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. .. .. ..@ listData       : Named list()
  .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ motifs            : NULL
  .. .. .. ..@ fragments         :List of 1
  .. .. .. .. ..$ :Formal class 'Fragment' [package "Signac"] with 3 slots
  .. .. .. .. .. .. ..@ path : chr "/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
  .. .. .. .. .. .. ..@ hash : chr [1:2] "a959ef83dfb9cae6ff73ab0147d547d1" "df967acbe28da89aed9cfdd89370b7af"
  .. .. .. .. .. .. ..@ cells: Named chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. .. .. .. .. ..- attr(*, "names")= chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. ..@ seqinfo           : NULL
  .. .. .. ..@ annotation        :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 25 levels "chrX","chr20",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. .. .. .. .. ..@ lengths        : int [1:25] 92459 65287 276555 135207 190802 141516 182677 182178 116305 194669 ...
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. .. .. ..@ start          : int [1:3021151] 276322 276324 276353 281055 281192 281194 281462 281482 281482 281482 ...
  .. .. .. .. .. .. .. ..@ width          : int [1:3021151] 73 71 42 67 493 63 223 203 203 203 ...
  .. .. .. .. .. .. .. ..@ NAMES          : chr [1:3021151] "ENSE00001489430" "ENSE00001536003" "ENSE00002160563" "ENSE00001750899" ...
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 1 2 1 2 1 2 1 2 1 2 ...
  .. .. .. .. .. .. .. ..@ lengths        : int [1:108826] 57 70 26 32 119 1 72 12 5 50 ...
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. .. .. .. ..@ seqnames   : chr [1:25] "chrX" "chr20" "chr1" "chr6" ...
  .. .. .. .. .. .. .. ..@ seqlengths : int [1:25] 156040895 64444167 248956422 170805979 198295559 159345973 133275309 135086622 190214555 83257441 ...
  .. .. .. .. .. .. .. ..@ is_circular: logi [1:25] FALSE FALSE FALSE FALSE FALSE FALSE ...
  .. .. .. .. .. .. .. ..@ genome     : chr [1:25] "hg38" "hg38" "hg38" "hg38" ...
  .. .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. .. .. .. ..@ nrows          : int 3021151
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. .. .. ..@ listData       :List of 5
  .. .. .. .. .. .. .. .. ..$ tx_id       : chr [1:3021151] "ENST00000399012" "ENST00000484611" "ENST00000430923" "ENST00000445062" ...
  .. .. .. .. .. .. .. .. ..$ gene_name   : chr [1:3021151] "PLCXD1" "PLCXD1" "PLCXD1" "PLCXD1" ...
  .. .. .. .. .. .. .. .. ..$ gene_id     : chr [1:3021151] "ENSG00000182378" "ENSG00000182378" "ENSG00000182378" "ENSG00000182378" ...
  .. .. .. .. .. .. .. .. ..$ gene_biotype: chr [1:3021151] "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
  .. .. .. .. .. .. .. .. ..$ type        : Factor w/ 4 levels "cds","exon","gap",..: 2 2 2 2 2 2 2 2 2 2 ...
  .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ bias              : NULL
  .. .. .. ..@ positionEnrichment: list()
  .. .. .. ..@ links             :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 0 levels: 
  .. .. .. .. .. .. .. ..@ lengths        : int(0) 
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. .. .. ..@ start          : int(0) 
  .. .. .. .. .. .. .. ..@ width          : int(0) 
  .. .. .. .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 
  .. .. .. .. .. .. .. ..@ lengths        : int(0) 
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. .. .. .. ..@ seqnames   : chr(0) 
  .. .. .. .. .. .. .. ..@ seqlengths : int(0) 
  .. .. .. .. .. .. .. ..@ is_circular: logi(0) 
  .. .. .. .. .. .. .. ..@ genome     : chr(0) 
  .. .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. .. .. .. ..@ nrows          : int 0
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. .. .. ..@ listData       : Named list()
  .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ counts            :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:85596796] 12 16 25 27 28 30 34 37 38 39 ...
  .. .. .. .. .. ..@ p       : int [1:11910] 0 13878 21131 27659 28565 31888 36155 47788 55033 63635 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 108377 11909
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:108377] "chr1-10109-10357" "chr1-180730-181630" "chr1-191491-191736" "chr1-267816-268196" ...
  .. .. .. .. .. .. ..$ : chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. .. .. ..@ x       : num [1:85596796] 1 2 2 2 2 2 4 6 8 10 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ data              :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:85596796] 12 16 25 27 28 30 34 37 38 39 ...
  .. .. .. .. .. ..@ p       : int [1:11910] 0 13878 21131 27659 28565 31888 36155 47788 55033 63635 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 108377 11909
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:108377] "chr1-10109-10357" "chr1-180730-181630" "chr1-191491-191736" "chr1-267816-268196" ...
  .. .. .. .. .. .. ..$ : chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  .. .. .. .. .. ..@ x       : num [1:85596796] 1 2 2 2 2 2 4 6 8 10 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ scale.data        : num[0 , 0 ] 
  .. .. .. ..@ assay.orig        : NULL
  .. .. .. ..@ var.features      : logi(0) 
  .. .. .. ..@ meta.features     :'data.frame': 108377 obs. of  0 variables
  .. .. .. ..@ misc              : Named list()
  .. .. .. ..@ key               : chr "atac_"
  ..@ meta.data   :'data.frame':    11909 obs. of  5 variables:
  .. ..$ orig.ident   : Factor w/ 1 level "SeuratProject": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ nCount_RNA   : num [1:11909] 8380 3771 6876 1733 5415 ...
  .. ..$ nFeature_RNA : int [1:11909] 3308 1896 2904 846 2282 1353 3061 1691 3028 1783 ...
  .. ..$ nCount_ATAC  : num [1:11909] 55582 20495 16674 2007 7658 ...
  .. ..$ nFeature_ATAC: int [1:11909] 13878 7253 6528 906 3323 4267 11633 7245 8602 7719 ...
  ..@ active.assay: chr "RNA"
  ..@ active.ident: Factor w/ 1 level "SeuratProject": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:11909] "AAACAGCCAAGGAATC-1" "AAACAGCCAATCCCTT-1" "AAACAGCCAATGCGCT-1" "AAACAGCCACACTAAT-1" ...
  ..@ graphs      : list()
  ..@ neighbors   : list()
  ..@ reductions  : list()
  ..@ images      : list()
  ..@ project.name: chr "SeuratProject"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 5 0 0
  ..@ commands    : list()
  ..@ tools       : list()

👆🏾 pbmc@assays$RNA@data and pbmc@assays$RNA@counts contain both the raw counts. Only after NormalizeData() that pbmc@assays$RNA@data will store normalized data.

Quality control: the idea behind the metrics…

When you import the data Seurat automatically calculate the number of detected features per cell and the total counts par cell. These data are present in the pbmc@meta.data section or accessible just with pbmc$.

colnames(pbmc@meta.data)
[1] "orig.ident"    "nCount_RNA"    "nFeature_RNA"  "nCount_ATAC"  
[5] "nFeature_ATAC"

What are we looking for:

  • non-empty droplets

  • no (few) doublets

  • non-dying or alive cells

What metrics would help us discard unwanted samples (cells):

  • total detected genes

  • total RNA (ATAC fragments) detected

  • percentage of mitochondrial genes (dying cells would have their membranes disrupted causing nuclear transcripts to leak)

  • some suggest ribosomal genes as well could represent a significantly higher percentage…

  • Doublets could be considered as samples that demonstrate significantly higher number of genes detected or RNA counts (ATAC fragments). The are some how outliers when you look at the distribution of nCounts_RNA and nFeatures_RNA, but they can also be detected using scrublet that creates artificial doublets and label cells as “doublets” if their profile resembles that of the artificial doublets.

For the ATAC data:

  • Nucleosomal signals: sizes of ATAC fragments should display a strong enrichment at the sizes of n * nucleosomes (n = c(1,2,3 …), around 200, 400, 600 …). The ratio of mononucleosomal fragments to nucleosome-free fragments could be calculated using NucleosomeSignal(). Dying cells would have more naked or unwrapped DNA.

  • TSS enrichment score: TSSEnrichment() calculates ATAC signal ratio between at TSS sites and at TSS flanking regions. Low TSS enrichment score could suggest low expression and so the signal is very comparable to the “background” signal…

  • You can also add FriP (fragments in peaks) and fragments in blacklisted regions to see how specific your ATAC experiment is towards open chromatin regions…

Let’s make the quality measurements

Percentage of mitochondrial RNA molecules. You can do the same if you wish to consider ribosomal genes as quality criterion (or probably discard their effect in your clustering etc).

DefaultAssay(pbmc) ="RNA"
mito.genes = grep(pattern = "^MT-", 
                  x = rownames(x = pbmc@assays$RNA@data), value = TRUE)
# pattern = "(^RPL|^RPS|^MRP)" for ribosomal genes in humans
percent.mito <- Matrix::colSums(pbmc@assays$RNA@counts[mito.genes, ])/
  Matrix::colSums(pbmc@assays$RNA@counts)
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mt")

Let’s make the 2 major metrics for the ATAC data…

DefaultAssay(pbmc) ="ATAC"
pbmc <- NucleosomeSignal(object = pbmc)
pbmc <- TSSEnrichment(object = pbmc, fast = TRUE)
saveRDS(pbmc,"/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_preprocessed.RDATA")
pbmc = readRDS("/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_preprocessed.RDATA")
DensityScatter(pbmc, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)

pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 3, 'High', 'Low')
# The nect line isn't working, no idea why
# TSSPlot(pbmc, group.by = 'high.tss',assay = "ATAC") + NoLegend()
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')

Finding doublets with scrublet

This part is from this site

library(reticulate)
use_condaenv(condaenv = "singlecell",required = TRUE)
# dir.create("doublets")
# exprData = Matrix::Matrix(as.matrix(pbmc@assays$RNA@counts),sparse = TRUE)
# Matrix::writeMM(exprData,"doublets/matrix.mtx")
import scrublet as scr
import scipy.io
import numpy as np
import os

#Load raw counts matrix and gene list
input_dir = '/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/'
counts_matrix = scipy.io.mmread(input_dir + 'doublets/matrix.mtx').T.tocsc()

#Initialize Scrublet object
scrub = scr.Scrublet(counts_matrix,
                     expected_doublet_rate=0.1,
                     sim_doublet_ratio=2,
                     n_neighbors = 8)

#Run the default pipeline
doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=1, 
                                                          min_cells=3, 
                                                          min_gene_variability_pctl=85, 
                                                          n_prin_comps=25)
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.21
Detected doublet rate = 11.2%
Estimated detectable doublet fraction = 66.3%
Overall doublet rate:
    Expected   = 10.0%
    Estimated  = 16.8%
Elapsed time: 15.4 seconds
pbmc@meta.data$DoubletScore = py$doublet_scores
# Plot doublet score
ggplot(pbmc@meta.data, aes(x = DoubletScore, after_stat(ndensity))) +
  geom_histogram(bins = 200, colour ="lightgrey")+
  geom_vline(xintercept = 0.21, colour = "red", linetype = 2)+
  geom_vline(xintercept = 0.2, colour = "green", linetype = 2) # Manually set threshold

# Manually set threshold at doublet score to 0.15
pbmc@meta.data$Predicted_doublets <- ifelse(py$doublet_scores > 0.2, "Doublet","Singlet" )
table(pbmc@meta.data$Predicted_doublets)

Doublet Singlet 
   1329   10580 
library(cowplot)

# Seurat's dedicated function is failing Idk when rendering
# pbmc$percent.mt = PercentageFeatureSet(pbmc,pattern = "^MT-")


# ribo.genes <- grep(pattern = "(^RPL|^RPS|^MRP)", x = rownames(x = pbmc@assays$RNA@data), value = TRUE)

VlnPlot(pbmc,features = c("nFeature_RNA", "nCount_RNA", "percent.mt","nFeature_ATAC","nCount_ATAC"), group.by = "Predicted_doublets",ncol = 3)

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

Filtering cells

You can filter out genes that are expressed by a few number of cells, which could save some memory space and computation time, also remove some noises for PCA and clustering …

# This throws an error after wards so probably do this at the start or use 
# the thresholds when creating the Seurat object min.cells= 10
num.cells = rowSums(pbmc@assays$RNA@counts>0)
length(num.cells)
# at.least 10 cells express the gene
length(num.cells[num.cells>10])
pbmc@assays$RNA@data = pbmc@assays$RNA@data[names(num.cells[num.cells<10]),]
pbmc@assays$RNA@counts = pbmc@assays$RNA@counts[names(num.cells[num.cells<10]),]

num.cells = rowSums(pbmc@assays$ATAC@counts>0)
length(num.cells)
# at.least 3 cells have the peak, ATAC peak is 
# less conserved in terms of position
length(num.cells[num.cells>3])
pbmc@assays$ATAC@data = pbmc@assays$ATAC@data[names(num.cells[num.cells<3]),]
pbmc@assays$ATAC@counts = pbmc@assays$ATAC@counts[names(num.cells[num.cells<10]),]

Getting thresholds: You can do it either through visual inspection by setting manually cut-offs to exclude outliers. Look at your boxplots, distribution histograms, scatter plots etc. You can also find outliers using quantile() or the low/high Median Absolute Deviations (mad()) which estimates the deviations with regard to the median value.

library(ggExtra)
max.mito.thr <- median(pbmc$percent.mt) + 3*mad(pbmc$percent.mt)
# For mitochondrial percentages the lower the value, the better it is,
# so no lower bound threshold to calculate
# min.mito.thr <- median(pbmc$percent.mt) - 3*mad(pbmc$percent.mt)
thr_quantiles = quantile(pbmc$percent.mt,0.95)

p1 <- ggplot(pbmc@meta.data, aes(x=nFeature_RNA, y=percent.mt)) +
      geom_point() +
      geom_hline(aes(yintercept = thr_quantiles[1]), colour = "red", linetype = 2) +
      # geom_hline(aes(yintercept = thr_quantiles[1]), colour = "red", linetype = 2) +
      annotate(geom = "text", 
               label = paste0(as.numeric(
                 table(pbmc$percent.mt > thr_quantiles[1])[2]),
                              " cells removed\n",
                 as.numeric(
                   table(pbmc$percent.mt > thr_quantiles[1])[1]),
                              " cells remain"), x = 6000, y = 0.1)

ggMarginal(p1, type = "histogram", fill="lightgrey", bins=100) 

Same thing with a Seurat function

DensityScatter(pbmc,x = "nFeature_RNA",y="percent.mt",quantiles = TRUE,log_x = TRUE)

# library(ggExtra)
# library(cowplot)
# # Relation between nUMI and nGene detected
# p1 <- ggplot(pbmc@meta.data, aes(x=nCount_RNA, y=nFeature_RNA)) + geom_point() + geom_smooth(method="lm")
# p1 <- ggMarginal(p1, type = "histogram", fill="lightgrey")
# 
# p2 <- ggplot(pbmc@meta.data, aes(x=log10(nCount_RNA), y=log10(nFeature_RNA))) + geom_point() + geom_smooth(method="lm")
# p2 <- ggMarginal(p2, type = "histogram", fill="lightgrey")
# 
# plot_grid(plotlist = list(p1,p2), ncol=2, align='h', rel_widths = c(1, 1))

On the number of genes detected/ RNA molecules captured

# Set low and high thresholds on the number of detected genes
min.Genes.thr <- median(log10(pbmc$nFeature_RNA)) - 3*mad(log10(pbmc$nFeature_RNA))
max.Genes.thr <- median(log10(pbmc$nFeature_RNA)) + 3*mad(log10(pbmc$nFeature_RNA))

# Set high threshold on the number of transcripts
max.nUMI.thr <- median(log10(pbmc$nCount_RNA)) + 3*mad(log10(pbmc$nCount_RNA))
# Gene/UMI scatter plot before filtering


# Set low and high thresholds on the number of detected genes
min.Peaks.thr <- median(log10(pbmc$nFeature_ATAC)) - 3*mad(log10(pbmc$nFeature_ATAC))
max.Peaks.thr <- median(log10(pbmc$nFeature_ATAC)) + 3*mad(log10(pbmc$nFeature_ATAC))

# Set high threshold on the number of transcripts
max.ATAC_frag.thr <- median(log10(pbmc$nCount_ATAC)) + 3*mad(log10(pbmc$nCount_ATAC))
# Gene/UMI scatter plot before filtering


p1 <- ggplot(pbmc@meta.data, aes(x=log10(nCount_RNA), y=log10(nFeature_RNA),col=Predicted_doublets)) +
      geom_point() +
      geom_smooth(method="lm") +
      geom_hline(aes(yintercept = min.Genes.thr), colour = "green", linetype = 2) +
      geom_hline(aes(yintercept = max.Genes.thr), colour = "green", linetype = 2) +
      geom_vline(aes(xintercept = max.nUMI.thr), colour = "red", linetype = 2)
p2 <- ggplot(pbmc@meta.data, aes(x=log10(nCount_ATAC), y=log10(nFeature_ATAC),col=Predicted_doublets)) +
      geom_point() +
      geom_smooth(method="lm") +
      geom_hline(aes(yintercept = min.Peaks.thr), colour = "green", linetype = 2) +
      geom_hline(aes(yintercept = max.Peaks.thr), colour = "green", linetype = 2) +
      geom_vline(aes(xintercept = max.ATAC_frag.thr), colour = "red", linetype = 2)

patchwork::wrap_elements(ggMarginal(p1, type = "histogram", fill="lightgrey"))+
  patchwork::wrap_elements(ggMarginal(p2, type = "histogram", fill="lightgrey"))

Let’s stick with the quantile thresholds for today

nUMI.thrs = quantile(pbmc$nCount_RNA,c(0.05,0.95))
nGenes.thrs = quantile(pbmc$nFeature_RNA,c(0.05,0.95))
nFrags.thrs = quantile(pbmc$nCount_ATAC,c(0.05,0.95))
nPeaks.thrs = quantile(pbmc$nFeature_ATAC,c(0.05,0.95))
mito_thr = quantile(pbmc$percent.mt,0.95)
tss.thr = quantile(pbmc$TSS.enrichment,0.05)
nucleo.thr = quantile(pbmc$nucleosome_signal,0.95)
pbmc = subset(
  x= pbmc,
  subset = nCount_RNA < nUMI.thrs[2] &
    nCount_RNA > nUMI.thrs[1] &
    nFeature_RNA < nGenes.thrs[2] &
    nFeature_RNA > nGenes.thrs[1] & 
    percent.mt < mito_thr[1])
pbmc=subset(
  x=pbmc,
  subset = nCount_ATAC < nFrags.thrs[2] &
    nCount_ATAC > nFrags.thrs[1] &
    nFeature_ATAC < nPeaks.thrs[2] &
    nFeature_ATAC > nPeaks.thrs[1] &
    TSS.enrichment > tss.thr[1] &
    nucleosome_signal < 2
)
pbmc = pbmc[,which(pbmc$Predicted_doublets=="Singlet")]

SCTransform to normalize, scale and find variable features + remove cell cycle effects… + PCA

The 3 steps process that was recommended/required is now run with one command in Seurat. You can also regress out some undesired variations. These can be due to cell cycle stage differences between cells, sequencing depth, percentage of mitochondrial transcripts etc.

DefaultAssay(pbmc) = "RNA"
pbmc = CellCycleScoring(pbmc,s.features = Seurat::cc.genes.updated.2019$s.genes,g2m.features = Seurat::cc.genes$g2m.genes)
head(pbmc$Phase)
AAACAGCCAATCCCTT-1 AAACAGCCAATGCGCT-1 AAACAGCCACCAACCG-1 AAACAGCCAGGATAAC-1 
             "G2M"                "S"                "S"                "S" 
AAACAGCCAGTAGGTG-1 AAACAGCCAGTTTACG-1 
              "G1"               "G1" 
# DefaultAssay(pbmc) = "RNA"
pbmc = SCTransform(pbmc,vars.to.regress = "Phase")
names(pbmc)
[1] "RNA"  "ATAC" "SCT" 

There’s a new assay called “SCT” that is added to your object. SCTransform does not add the scaled data at the pbmc@assays@RNA@data anymore. That’s what the previous steps do (NormalizeData() ScaleData())

pbmc = RunPCA(pbmc)
names(pbmc@reductions)
[1] "pca"

Inorder to decide which components to keep just do an elbow plot and choose ndims below which adding components adds “insignificant” variation from the data…

ElbowPlot(pbmc,ndims = 50,reduction = "pca")

You can also perform JackStraw() to get the optimal number of PCs, but this takes way too much time…

VizDimLoadings(pbmc,dims = 1:2)

DimPlot(pbmc, reduction = "pca") + NoLegend()

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE,reduction = "pca",assays = "SCT")

Same process for the ATAC data except that you perform Latent Semantic Indexing (LSI) which is more adapted to the higher sparsity nature of ATAC-Seq data

DefaultAssay(pbmc) = "ATAC"
pbmc <- FindTopFeatures(pbmc, min.cutoff = 5)
pbmc <- RunTFIDF(pbmc)
# If SVD throws an error:
# Error in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) : 
  # function 'as_cholmod_sparse' not provided by package 'Matrix'
   # install.packages("remotes")
   # remotes::install_version("Matrix", version = "1.6-1")
   # packageVersion("Matrix")
pbmc <- RunSVD(pbmc)
names(pbmc)
[1] "RNA"  "ATAC" "SCT"  "pca"  "lsi" 
names(pbmc@reductions)
[1] "pca" "lsi"

The first LSI dimension often correlates with sequencing coverage so may not be relevant to capture biological differences…

DepthCor(pbmc,n = 20,reduction = "lsi")

DepthCor(pbmc,reduction = "pca",n = 20)

VizDimLoadings(pbmc,reduction = "lsi",dims = 1:2)

Non-linear dimensionality reduction

set.seed(12345)
DefaultAssay(pbmc)="SCT"
pbmc = RunUMAP(pbmc,dims = 1:20,reduction = "pca",reduction.name = "UMAP_RNA",seed.use = 42)
DefaultAssay(pbmc)="ATAC"
pbmc = RunUMAP(pbmc,dims = 2:20,reduction = "lsi",reduction.name = "UMAP_ATAC",seed.use = 42)
DimPlot(pbmc,reduction = "UMAP_RNA",dims = 1:2)+NoLegend()

DefaultAssay(pbmc)="SCT"
pbmc = RunTSNE(pbmc,dims = 1:20,reduction = "pca",reduction.name = "TSNE_RNA")
DefaultAssay(pbmc)="ATAC"
pbmc = RunTSNE(pbmc,dims = 2:20,reduction = "lsi",reduction.name = "TSNE_ATAC")
names(pbmc@reductions)
[1] "pca"       "lsi"       "UMAP_RNA"  "UMAP_ATAC" "TSNE_RNA"  "TSNE_ATAC"
DimPlot(pbmc,reduction = "TSNE_RNA",dims = 1:2)+NoLegend()

Let’s take advantage of the bimodal nature of our data

FindMultiModalNeighbors() allows to apply the weighted nearest neighbor (WNN) method to compute a joint neighbor graph with both the DNA accessibility and gene expression information.

pbmc = FindMultiModalNeighbors(pbmc,
                        reduction.list = list("pca","lsi"),
                        dims.list = list(1:20,2:20),
                        #this will be the column name for this values in metadata
                        modality.weight.name = "wnn.weight",
                        weighted.nn.name = "weighted.nn"
                        )
pbmc <- RunUMAP(
  object = pbmc,
  nn.name = "weighted.nn",
  assay = "RNA",
  verbose = TRUE
)
DimPlot(pbmc, label = TRUE, repel = TRUE, reduction = "umap") + NoLegend()

DefaultAssay(pbmc) = "SCT"
FeaturePlot(pbmc,
            features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
            reduction = "umap")

# To get the ATAC peaks with the most loadings
Loadings(pbmc@reductions$lsi) %>% as.data.frame() %>% arrange(desc(LSI_3)) %>% head()
                               LSI_1       LSI_2       LSI_3        LSI_4
chr14-99223600-99254668 -0.007628470 0.012400576 0.010817589 -0.004434770
chr14-99255246-99275454 -0.007854201 0.012734525 0.010767515 -0.005447933
chr14-99181080-99219442 -0.007250585 0.011646868 0.010551247 -0.011898602
chr14-61319750-61346673 -0.008255471 0.008615372 0.010033276 -0.000902028
chr5-35850992-35860227  -0.007207034 0.010311381 0.009872415 -0.011560768
chr17-82125073-82129615 -0.006967733 0.011069493 0.009777427 -0.015393326
                                LSI_5         LSI_6        LSI_7         LSI_8
chr14-99223600-99254668 -0.0019489660 -0.0032077353 0.0017053735 -0.0006294127
chr14-99255246-99275454 -0.0011452165 -0.0002146083 0.0020921890 -0.0011122315
chr14-99181080-99219442 -0.0004983464  0.0010122050 0.0023871137  0.0004908350
chr14-61319750-61346673  0.0021863489  0.0040002183 0.0009511627 -0.0022178337
chr5-35850992-35860227  -0.0025353353 -0.0096210538 0.0024193772  0.0077770605
chr17-82125073-82129615  0.0006023086  0.0008706356 0.0008168991 -0.0012402559
                               LSI_9        LSI_10        LSI_11       LSI_12
chr14-99223600-99254668 -0.011132964 -0.0093843946  4.154657e-03  0.010863245
chr14-99255246-99275454 -0.009529399 -0.0043277224  2.526717e-03  0.007909527
chr14-99181080-99219442 -0.006025062 -0.0035667795  9.925949e-05  0.004619453
chr14-61319750-61346673 -0.011899630  0.0006717987 -1.442966e-03  0.003298536
chr5-35850992-35860227  -0.003192896 -0.0042330533  8.446166e-04 -0.006473115
chr17-82125073-82129615 -0.001138216 -0.0010613865  6.345181e-04  0.003638770
                               LSI_13       LSI_14        LSI_15       LSI_16
chr14-99223600-99254668  1.703736e-03 -0.002519142  0.0031849937  0.007487128
chr14-99255246-99275454  8.072692e-04  0.001693905  0.0013283432  0.004236698
chr14-99181080-99219442  1.416543e-04  0.002269181  0.0011377762  0.004697183
chr14-61319750-61346673  5.452953e-05  0.002416436 -0.0030667115  0.003494038
chr5-35850992-35860227   2.142510e-03  0.003819993  0.0037431024  0.004691436
chr17-82125073-82129615 -1.000601e-03  0.002131181  0.0009451103 -0.005870090
                              LSI_17        LSI_18       LSI_19       LSI_20
chr14-99223600-99254668  0.008775327 -0.0022655096 0.0019040193  0.002107986
chr14-99255246-99275454  0.005994769 -0.0032846444 0.0036079943  0.003713431
chr14-99181080-99219442  0.005153732 -0.0024304011 0.0033792659  0.002142957
chr14-61319750-61346673  0.003906380  0.0003291849 0.0003909113  0.002184818
chr5-35850992-35860227   0.004473689  0.0007874884 0.0016535765  0.001374156
chr17-82125073-82129615 -0.004170894  0.0026900899 0.0041327251 -0.001094097
                               LSI_21       LSI_22        LSI_23        LSI_24
chr14-99223600-99254668 -0.0027258564  0.003570945 -0.0001685325 -3.362386e-03
chr14-99255246-99275454 -0.0021679777  0.001982417 -0.0022438401  1.785066e-04
chr14-99181080-99219442 -0.0034227390  0.004966769  0.0048786504 -9.049408e-04
chr14-61319750-61346673 -0.0007488389  0.003899228  0.0021288767 -6.207149e-05
chr5-35850992-35860227  -0.0007100582  0.002036220 -0.0072692787  1.645252e-03
chr17-82125073-82129615  0.0021353513 -0.001518692 -0.0051345820  1.559087e-03
                              LSI_25      LSI_26        LSI_27        LSI_28
chr14-99223600-99254668  0.003327571 0.004468130 -0.0007320344 -0.0019470810
chr14-99255246-99275454  0.004322688 0.005138780 -0.0001081651 -0.0005448581
chr14-99181080-99219442  0.001014499 0.001004196  0.0010194374 -0.0018995653
chr14-61319750-61346673  0.001727161 0.003542329 -0.0003593447 -0.0023407094
chr5-35850992-35860227   0.004116394 0.009411230  0.0025895297  0.0026528320
chr17-82125073-82129615 -0.000595900 0.005102839  0.0005450004 -0.0003172489
                               LSI_29        LSI_30        LSI_31        LSI_32
chr14-99223600-99254668 -0.0011798850  2.715963e-04  0.0003942581  2.304012e-05
chr14-99255246-99275454 -0.0015856873 -5.313156e-04  0.0001624633 -2.568710e-04
chr14-99181080-99219442 -0.0014141593 -5.854104e-05 -0.0002861250 -1.386881e-03
chr14-61319750-61346673 -0.0014112333 -7.808850e-04  0.0010189535  6.686418e-04
chr5-35850992-35860227   0.0007217527 -1.897177e-03  0.0003149366 -3.012595e-04
chr17-82125073-82129615  0.0011188818 -1.365647e-03  0.0005466905 -1.076233e-03
                               LSI_33        LSI_34        LSI_35        LSI_36
chr14-99223600-99254668  0.0001412327  0.0001932593  0.0006947994 -9.321940e-05
chr14-99255246-99275454  0.0001851521  0.0004623644  0.0015618273 -1.290627e-05
chr14-99181080-99219442 -0.0001847523  0.0004700394  0.0012676623 -8.925335e-04
chr14-61319750-61346673 -0.0004508646 -0.0002006617  0.0004776256  1.162080e-03
chr5-35850992-35860227   0.0003109253  0.0005330136  0.0012439127  1.115258e-06
chr17-82125073-82129615  0.0003241424 -0.0004054085 -0.0001557881  9.052694e-04
                              LSI_37        LSI_38        LSI_39        LSI_40
chr14-99223600-99254668 4.377710e-04 -5.238493e-04  1.135363e-03 -0.0002413457
chr14-99255246-99275454 1.223332e-03 -7.350898e-05  1.117961e-03  0.0001002190
chr14-99181080-99219442 1.346749e-03  2.282050e-04 -3.533265e-04  0.0003079743
chr14-61319750-61346673 2.188553e-04  9.562389e-04  2.535752e-04 -0.0008771883
chr5-35850992-35860227  3.144855e-05 -2.758017e-04  8.248225e-05 -0.0016091318
chr17-82125073-82129615 1.023629e-03  1.058190e-03  1.187203e-03 -0.0009636582
                               LSI_41        LSI_42        LSI_43       LSI_44
chr14-99223600-99254668  0.0004511260 -0.0006693672  0.0002831568 0.0011905566
chr14-99255246-99275454  0.0003336158 -0.0008451023  0.0005354706 0.0017548762
chr14-99181080-99219442  0.0011121347 -0.0003464882  0.0010439380 0.0001249209
chr14-61319750-61346673  0.0006811728 -0.0014236307 -0.0011017758 0.0036515313
chr5-35850992-35860227  -0.0009144450 -0.0026082826  0.0012961058 0.0000653394
chr17-82125073-82129615 -0.0002182979 -0.0009851880  0.0011278336 0.0014216917
                               LSI_45        LSI_46        LSI_47        LSI_48
chr14-99223600-99254668 -8.206416e-04 -6.306682e-04 -0.0007835574  0.0007475609
chr14-99255246-99275454  6.052715e-06 -1.258104e-03  0.0002731729  0.0003949774
chr14-99181080-99219442 -7.480977e-05 -4.156499e-05  0.0001149593  0.0008087828
chr14-61319750-61346673 -1.462751e-03 -5.210960e-04  0.0011797973 -0.0022184098
chr5-35850992-35860227  -1.356139e-03 -1.103301e-03  0.0004045562  0.0005777876
chr17-82125073-82129615 -1.180264e-03 -3.957742e-04  0.0002440161  0.0007537265
                              LSI_49        LSI_50
chr14-99223600-99254668 2.118996e-04 -0.0001717247
chr14-99255246-99275454 3.111033e-04 -0.0003134512
chr14-99181080-99219442 5.517744e-04 -0.0007772940
chr14-61319750-61346673 9.639046e-04 -0.0003323627
chr5-35850992-35860227  8.189216e-05 -0.0006816183
chr17-82125073-82129615 1.169985e-03 -0.0010474237
DefaultAssay(pbmc) = "ATAC"
FeaturePlot(pbmc,
            features = c('chr14-99255246-99275454', 'chr20-50269694-50277398', 'chr14-99223600-99254668'),
            reduction = "umap")

WNN also finds by default k = 20 neighbors pbmc@neighbors$weighted.nn@nn.idx, but you can also construct a KNN graph based on the euclidean distance in PCA/LSI space.

Seurat::Neighbors(pbmc)
[1] "weighted.nn"
# FindNeighbors(pbmc,reduction = "pca",dims = 1:20)
# FindClusters requires a graph.name
# pbmc@graphs gives the graph names
pbmc = FindClusters(pbmc,resolution = 0.8,graph.name = "wsnn",)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8517
Number of edges: 284299

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9024
Number of communities: 17
Elapsed time: 0 seconds
DimPlot(pbmc,reduction = "umap")

Annotate cell clusters

If you have markers from published data you can label your clusters based on a reference annotation. This is an example given at the Signac users guide page.

wget -O /home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_multimodal.h5seurat https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat
library(SeuratDisk)

# load PBMC reference
reference <- LoadH5Seurat("/home/robel/mount/NAS_CUVIER/grpCuvier/rtesfaye/M2_TP_2023/singlecell/pbmc_multimodal.h5seurat", assays = list("SCT" = "counts"), reductions = 'spca')
DefaultAssay(pbmc) <- "SCT"
# transfer cell type labels from reference to query
transfer_anchors <- FindTransferAnchors(
  reference = reference,
  query = pbmc,
  normalization.method = "SCT",
  reference.reduction = "spca",
  recompute.residuals = FALSE,
  dims = 1:50
)
predictions <- TransferData(
  anchorset = transfer_anchors, 
  refdata = reference$celltype.l2,
  weight.reduction = pbmc[['pca']],
  dims = 1:50
)

pbmc <- AddMetaData(
  object = pbmc,
  metadata = predictions
)

# set the cell identities to the cell type predictions
Idents(pbmc) <- "predicted.id"

# set a reasonable order for cell types to be displayed when plotting
levels(pbmc) <- c("CD4 Naive", "CD4 TCM", "CD4 CTL", "CD4 TEM", "CD4 Proliferating",
                  "CD8 Naive", "dnT",
                 "CD8 TEM", "CD8 TCM", "CD8 Proliferating", "MAIT", "NK", "NK_CD56bright",
                 "NK Proliferating", "gdT",
                 "Treg", "B naive", "B intermediate", "B memory", "Plasmablast",
                 "CD14 Mono", "CD16 Mono",
                 "cDC1", "cDC2", "pDC", "HSPC", "Eryth", "ASDC", "ILC", "Platelet")
DimPlot(pbmc,label = TRUE)

But the same functions can be used to transfer annotation labels between scRNA-Seq and scATAC-Seq data if you don’t have multiomic data. To do that you create a gene activity matrix using your scATAC-Seq data. Here we presume that the more a gene is accessible, the more it is expressed, so you quantify ATAC signals within the gene body and 2k upstream TSS sites of all genes. You create an “RNA” assay with this quantifications. You can the same process to integrate 2 different scRNA-Seq data with the same populations (to remove batch effects). There is a more detailed explanation on CCA (Canonical Correlation Analaysis), the method implemented by Seurat, on this website.

gene_activities = GeneActivity(pbmc@assays$ATAC)
pbmc[['RNA_atac']] <- CreateAssayObject(counts = gene.activities)
pbmc = SCTransform(pbmc,assay = "RNA_atac")
FindTransferAnchors(reference = pbmc@assays$RNA,reduction = "cca")
TransferData()#...

But you can also find markers of your clusters to annotate them manually by identifying genes/regulatory elements that define them. You can do pair-wise comparisons between your clusters or FindAllMarkers() to identify markers of each cluster (which takes quite some time). You can do this for both your gene expression matrix and ATAC peaks.

DefaultAssay(pbmc) = "SCT"
DE_genes = FindMarkers(pbmc,ident.1 = "CD14 Mono",ident.2 = "B naive")
# FindAllMarkers() for all clusters
# 

If you want other tools to integrate different data and probably correct batch effects there is also harmony and many more. ArchR to integrate scATAC and scRNA. They all have their limits and advantages, and they might not suit your questions and data.

Peaks to genes

In the Signac paper, there is a method presented to link genes to ATAC peaks. Briefly, the idea is to associate ATAC peaks and genes that are co-regulated across cells. A pearson correlation test is performed between the expression of a gene of interest and signal of ATAC peak found within a 500 Kb windows of the gene. To estimate the randomness of this correlation, a pearson correlation is performed between the gene of interest and other randomly selected ATAC peaks with the same characteristics as the tested ATAC peak but located on another chromosome. A z-score is computed using this background signal and the observed signal and then a one-sided z-test computes p. values of the association.

RegionStats() gives GC content in the ATAC peaks, as characterstics to consider for selecting background peaks.

DefaultAssay(pbmc) <- "ATAC"

# first compute the GC content for each peak
pbmc <- RegionStats(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38)

# link peaks to genes
pbmc <- LinkPeaks(
  object = pbmc,
  peak.assay = "ATAC",
  expression.assay = "SCT",
  genes.use = c("LYZ", "MS4A1")
)
idents.plot <- c("B naive", "B intermediate", "B memory",
                 "CD14 Mono", "CD16 Mono", "CD8 TEM", "CD8 Naive")

p1 <- CoveragePlot(
  object = pbmc,
  region = "MS4A1",
  features = "MS4A1",
  expression.assay = "SCT",
  idents = idents.plot,
  extend.upstream = 500,
  extend.downstream = 10000
)

p2 <- CoveragePlot(
  object = pbmc,
  region = "LYZ",
  features = "LYZ",
  expression.assay = "SCT",
  idents = idents.plot,
  extend.upstream = 8000,
  extend.downstream = 5000
)
patchwork::wrap_plots(p1, p2, ncol = 1)

Sources used to prepare this demo

Signac user’s guide on multiomic data

Signac user’s guide on scATAC-seq data

Seurat user’s guide on scATAC-seq data

some course I found from Institut Gustave Roussy

OSCA

Quality control with basic R packages and Seurat

hbctraining

For trajectory analysis:

Look for monocle

Also to use monocle3 on Seurat objects

velocyto: the documentation is minimal